98 research outputs found

    Using percolated dependencies for phrase extraction in SMT

    Get PDF
    Statistical Machine Translation (SMT) systems rely heavily on the quality of the phrase pairs induced from large amounts of training data. Apart from the widely used method of heuristic learning of n-gram phrase translations from word alignments, there are numerous methods for extracting these phrase pairs. One such class of approaches uses translation information encoded in parallel treebanks to extract phrase pairs. Work to date has demonstrated the usefulness of translation models induced from both constituency structure trees and dependency structure trees. Both syntactic annotations rely on the existence of natural language parsers for both the source and target languages. We depart from the norm by directly obtaining dependency parses from constituency structures using head percolation tables. The paper investigates the use of aligned chunks induced from percolated dependencies in French–English SMT and contrasts it with the aforementioned extracted phrases. We observe that adding phrase pairs from any other method improves translation performance over the baseline n-gram-based system, percolated dependencies are a good substitute for parsed dependencies, and that supplementing with our novel head percolation-induced chunks shows a general trend toward improving all system types across two data sets up to a 5.26% relative increase in BLEU

    MaTrEx: the DCU machine translation system for ICON 2008

    Get PDF
    In this paper, we give a description of the machine translation system developed at DCU that was used for our participation in the NLP Tools Contest of the International Conference on Natural Language Processing (ICON 2008). This was our first ever attempt at working on any Indian language. In this participation, we focus on various techniques for word and phrase alignment to improve system quality. For the English-Hindi translation task we exploit source-language reordering. We also carried out experiments combining both in-domain and out-of-domain data to improve the system performance and, as a post-processing step we transliterate out-of-vocabulary items

    A Partition Based Novel Approach in AFIS for Forensics & Security

    Get PDF
    AbstractMany automatic fingerprint identification approaches have been suggested. Amongst those various methods, the minutiae-based fingerprint demonstration and recognition is widely used. Minutiae based representation has some disadvantages in comparison to other fingerprint approached in terms of sample size. This paper describes a novel concept of partition based approach for fingerprint identification which aims to improve error rates as well as processing time in matching fingerprints. We aim to split the identity image into well separated partitions in order to simplify the identification task. Our system will use the gray-scale information of the fingerprints. The system will select the primary fingerprint, perform feature extraction & feature matching to identify the image in the database by comparing the featured values of both the fingerprints. Our implementation mainly incorporates image pre-processing, image partitioning, image binarization, feature extraction and feature matching. It finally generates a score which tells whether two fingerprints match or not

    Evaluating syntax-driven approaches to phrase extraction for MT

    Get PDF
    In this paper, we examine a number of different phrase segmentation approaches for Machine Translation and how they perform when used to supplement the translation model of a phrase-based SMT system. This work represents a summary of a number of years of research carried out at Dublin City University in which it has been found that improvements can be made using hybrid translation models. However, the level of improvement achieved is dependent on the amount of training data used. We describe the various approaches to phrase segmentation and combination explored, and outline a series of experiments investigating the relative merits of each method

    English-Hindi transliteration using context-informed PB-SMT: the DCU system for NEWS 2009

    Get PDF
    This paper presents English—Hindi transliteration in the NEWS 2009 Machine Transliteration Shared Task adding source context modeling into state-of-the-art log-linear phrase-based statistical machine translation (PB-SMT). Source context features enable us to exploit source similarity in addition to target similarity, as modelled by the language model. We use a memory-based classification framework that enables efficient estimation of these features while avoiding data sparseness problems.We carried out experiments both at character and transliteration unit (TU) level. Position-dependent source context features produce significant improvements in terms of all evaluation metrics

    Phrase extraction and rescoring in statistical machine translation

    Get PDF
    The lack of linguistically motivated translation units or phrase pairs in Phrase-based Statistical Machine Translation (PB-SMT) systems is a well-known source of error. One approach to minimise such errors is to supplement the standard PB-SMT models with phrase pairs extracted from parallel treebanks (linguistically annotated and aligned corpora). In this thesis, we extend the treebank-based phrase extraction framework with percolated dependencies – a hitherto unutilised knowledge source – and evaluate its usability through more than a dozen syntax-aware phrase extraction models. However, the improvement in system performance is neither consistent nor conclusive despite the proven advantages of linguistically motivated phrase pairs. This leads us to hypothesize that the PB-SMT pipeline is flawed as it often fails to access perfectly good phrase-pairs while searching for the highest scoring translation (decoding). A model error occurs when the highest-probability translation (actual output of a PB-SMT system) according to a statistical machine translation model is not the most accurate translation it can produce. In the second part of this thesis, we identify and attempt to trace these model errors across state-of-the-art PB-SMT decoders by locating the position of oracle translations (the translation most similar to a reference translation or expected output of a PB-SMT system) in the n-best lists generated by a PB-SMT decoder. We analyse the impact of individual decoding features on the quality of translation output and introduce two rescoring algorithms to minimise the lower ranking of oracles in the n-best lists. Finally, we extend our oracle-based rescoring approach to a reranking framework by rescoring the n-best lists with additional reranking features. We observe limited but optimistic success and conclude by speculating on how our oracle-based rescoring of n-best lists can help the PB-SMT system (supplemented with multiple treebank-based phrase extractions) get optimal performance out of linguistically motivated phrase pairs

    RECENT ADVANCEMENTS IN EAR BIOMETRICS: A REVIEW

    Get PDF
    Ascertaining the identity of a person is quite an important aspect of Forensic Science. There are so many physiological features have been proved to be highly discriminating among individuals. Biometrics play a significant role in individualizing a person. Fingerprint, Palm print, Retina and Iris recognition are the most popular examples of it. Fingerprint and iris are generally considered to allow more accurate biometric recognition than the face, but the face is more easily used in surveillance scenarios where fingerprint and iris capture are not feasible. However, the face by itself is not yet as accurate and flexible as desired for this scenario due to expression changes, source of illumination, make-up, etc. Besides these limitations, ear images can be acquired in a similar manner to face images. A number of researchers have suggested that the human ear is unique enough to each individual to allow practical use as a biometric. In this article an attempt has been made to review all the recent researches of a decade made in the field of Ear Biometrics

    MATREX: the DCU MT system for WMT 2010

    Get PDF
    This paper describes the DCU machine translation system in the evaluation campaign of the Joint Fifth Workshop on Statistical Machine Translation and Metrics in ACL-2010. We describe the modular design of our multi-engine machine translation (MT) system with particular focus on the components used in this participation. We participated in the English–Spanish and English–Czech translation tasks, in which we employed our multiengine architecture to translate. We also participated in the system combination task which was carried out by the MBR decoder and confusion network decoder

    Tapadoir: developing a statistical machine translation engine and associated resources for Irish

    Get PDF
    Tapadoir (from the Irish ´ tapa ‘fast’ and the nominal suffix -oir ´ ) is a statistical machine translation (SMT) project, funded by the Irish government. This work was commissioned to help government translators meet the translation demands which have arisen from the Irish language’s status as an official EU and national language. The development of this system, which translates English into Irish (a morphologically rich, low-resourced minority language), has produced an interesting set of challenges. These challenges have inspired a creative response to the lack of data and NLP tools available for the Irish language and have also resulted in the development of new resources for the Irish linguistic and NLP community. We show that our SMT system out-performs Google TranslateTM (a widely used general-domain SMT system) as a result of steps we have taken to tailor translation output to the user’s specific needs
    corecore